Running RMarkdown in RStudio

All of the code that we will work through this semester will be stored as RMarkdown files, which have a .Rmd extension. These files are great because they allow us mix code and descriptions within the same file. Reading these notes will give you brief overview of how this works; we will practice hands-on in class.

When opening an RMarkdown file in RStudio, you should see a window similar to this (it will be slightly different on Windows and depending on your screen size):

On the left is the actual file itself. Some output and other helpful bits of information are shown on the right. There is also a Console window, which we generally will not need. I have minimized it in the graphic.

Notice that the file has parts that are on a white background and other parts that are on a grey background. The white parts correspond to text and the grey parts to code. In order to run the code, and to see the output, click on the green play button on the right of each block.

When you run code to read or create a new data set, the data will be listed in the Environment tab in the upper right hand side of RStudio:

Clicking on the data will open a spreadsheet version of the data that you can view to understand the structure of your data and to see all of the columns that are available for analysis:

Going back to the RMarkdown file by clicking on the tab on the upper row, we can see how graphics work in R. We have written some code to produce a scatter plot. When the code is run, the plot displays inside of the markdown file:

Make sure to save the notebook frequently. However, notice that only the text and code itself is saved. The results (plots, tables, and other output) are not automatically stored. This is actually helpful because the code is much smaller than the results and it helps to keep the file sizes small. If you would like to save the results in a way that can be shared with others, you need to knit the file by clicking on the Knit button (it has a ball of yarn icon) at the top of the notebook. After running all the code from scratch, it will produce an HTML version of our script that you can open in a web browser:

In fact, the notes that you are currently reading were created with RMarkdown files that are knitted to HTML.

Markdown

Markdown is a lightweight markup language used to format text for the web. RMarkdown documents combine markdown prose with R code. Here are some of the key things you will need to know how to do with markdown:

  1. Create headers using the # symbol. The number of # symbols determines the size of the header.
# Big header

The quick brown fox jumped over the lazy dog.

## Smaller header

Lorem ipsum dolor sit amet, consectetur adipiscing elit.
  1. Distinguish named objects using backticks:
The `seq` function generates a sequence of numbers.
  1. Create lists using * or - for bullet points and numbers for ordered lists:
* First item
* Second item
* Third item
  1. Create links using square brackets for the text and parentheses for the URL:
[Google](https://www.google.com)

You will see all of these elements (and more) in the .Rmd notebooks that you will complete for homework.

For more information, see the R Markdown guide.

Running R Code

We now want to give a very brief overview of how to run R code. We will now only show snippets of R code and the output rather than a screen shot of the entire RStudio session. Though, know that you should think of each of the snippets as occurring inside of one of the grey boxes in an RMarkdown file.

In one of its most basic forms, R can be used as a fancy calculator. For example, we can divide 12 by 4:

12 / 4
## [1] 3

We can also store values by creating new objects within R. To do this, use the <- (arrow) symbol, which is called the “assignment operator.” For example, we can create a new object called mynum with a value of 8 by:

mynum <- 3 + 5

We can now use our new object mynum exactly the same way that we we would use the number 8. For example, adding it to 1 to get the number nine:

mynum + 1
## [1] 9

Two things to note about object names:

# overwrite values to mynum:
mynum <- 10
mynum <- 15
mynum <- -10
# call the object:
mynum
## [1] -10

Object names must start with a letter, but can also use underscores and periods. This semester, we will use only lowercase letters and underscores for object names. That makes it easier to read and easier to remember what you have called things.

Running functions

A function in R is something that takes a number of input values and returns an output value. Generally, a function will look something like this:

function_name(arg1 = input1, arg2 = input2)

Where arg1 and arg2 are the names of the inputs (“arguments”) to the function (they are fixed) and input1 and input2 are the values that we will assign to them. The number of arguments is not always two, however. There may be any number of arguments, including zero. Also, there may be additional optional arguments that have default values that can be modified.

Let us look at an example function: seq. This function returns a sequence of numbers. We will can give the function two input arguments: the starting point from and the ending point to.

seq(from = 1, to = 7)
## [1] 1 2 3 4 5 6 7

The function returns a sequence of numbers starting from 1 and ending at 7 in increments of 1. The return values are shown (in this document) right below the code block. Note that you can also pass arguments by position, in which case we use the default ordering of the arguments. Here is the same code but without the names:

seq(1, 7)
## [1] 1 2 3 4 5 6 7

There is also an optional argument by that controls the spacing between each of the numbers. By default it is equal to 1, but we can change it to spread the point out by half spaces.

seq(from = 1, to = 7, by = 0.5)
##  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

We will learn how to use numerous functions in the coming notes.

Getting help

When you forget how to use a function (we all do!), you can look up the documentation like so:

?seq

This also works:

help(seq)

This will pull up the documentation for the seq function in the Help tab in RStudio. You can also search for functions in the Help tab by typing in the search bar.

Tabular data

In these notes we will be working with data that is stored in a tabular format. Let’s start with the vocabulary of tabular data:

  • A variable (or feature) is a quantity, quality, or property that you can measure.
  • A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
  • An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. We’ll sometimes refer to an observation as a data point.
  • Tabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.
    from R for Data Science (2023)

Here is an example of a tabular data set of food types, which has nine rows and five columns. Each row tells us the nutritional properties contained in 100 grams of a particular type of food.

Every row of the data set represents a particular object in our data set, each of which we call an observation. In our food type example, each individual food corresponds to a specific observation:

The columns in a tabular data set represent the measurements that we record for each observation. These measurements are called variables or features. In our example data set, we have five features which record the name of the food type, the food group that the food falls into, the number of calories in a 100g serving, the amount of sodium (mg) in a 100g serving, and the amount of vitamin A (as a percentage of daily recommended value) in a 100g serving.

A larger version of this data set, with more food types and nutritional facts, is included in the course materials. We will make extensive use of this data set in the following notes as a common example for creating visualizations, performing data manipulation, and building models. In order to read in the data set we use a function called read_csv and pass it a description of where the file is located relative to where this script is stored. The data is called foods.csv and is stored in the folder data. The following code will load the foods data set into R, save it as an object called food, and prints out the first several rows:

food <- read_csv(file = "../data/food.csv")
food
## # A tibble: 61 × 17
##    item       food_group calories total_fat sat_fat cholesterol sodium carbs
##    <chr>      <chr>         <dbl>     <dbl>   <dbl>       <dbl>  <dbl> <dbl>
##  1 Apple      fruit            52       0.1   0.028           0      1 13.8 
##  2 Asparagus  vegetable        20       0.1   0.046           0      2  3.88
##  3 Avocado    fruit           160      14.6   2.13            0      7  8.53
##  4 Banana     fruit            89       0.3   0.112           0      1 22.8 
##  5 Chickpea   grains          180       2.9   0.309           0    243 30.0 
##  6 String Be… vegetable        31       0.1   0.026           0      6  7.13
##  7 Beef       meat            288      19.5   7.73           87    384  0   
##  8 Bell Pepp… vegetable        26       0     0.059           0      2  6.03
##  9 Crab       fish             87       1     0.222          78    293  0.04
## 10 Broccoli   vegetable        34       0.3   0.039           0     33  6.64
## # ℹ 51 more rows
## # ℹ 9 more variables: fiber <dbl>, sugar <dbl>, protein <dbl>, iron <dbl>,
## #   vitamin_a <dbl>, vitamin_c <dbl>, wiki <chr>, description <chr>,
## #   color <chr>

Notice that the display shows that there are a total of 61 rows and 17 features. The first 10 rows and 10 columns are shown. At the bottom, the names of the additional feature names are given. As described above, if you run this RStudio, you can view a full tabular version of the data set by clicking on the data set name in the Environment tab. The abbreviations <chr> and <dbl> tell us which features are characters (item, food_type, wiki, description, and color) and which are numbers (all the others).

If you prefer to type, the following command does the same thing:

# note the capital V in View
View(food)

Many of the examples in the following notes will make use of this foods data set to demonstrate new concepts. Another related data set that will be also be useful for illustrating several concepts contains the prices of various food items for over 140 years. We can read it into R using similar block of code, namely:

food_prices <- read_csv(file = "../data/food_prices.csv")
food_prices
## # A tibble: 146 × 14
##     year   tea sugar peanuts coffee cocoa wheat   rye  rice  corn barley
##    <dbl> <dbl> <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1  1870  129.  151.    203.   88.1  78.8  88.1 103.   83.5 121.    103.
##  2  1871  132.  167.    222.  109.   66.7 118.  105.   84.5  88.4   130.
##  3  1872  134.  162.    189.  140.   71.6 122.  102.   92.9  69.2   125.
##  4  1873  136.  154.    179.  173.   65.8 116.  106.   91.0  67.1   166.
##  5  1874  146.  153.    231.  187.   69.9 113.  126.   99.6 128.    174.
##  6  1875  149.  150.    197.  176.   69.4 110.  116.   85.8 127.    161.
##  7  1876  150.  160.    172.  184.   80.7 114.  106.   95.3  91.2   132.
##  8  1877  149.  189.    153.  198.   87.8 144.   97.0 108.   94.5   125.
##  9  1878  150.  165.    160.  169.   96.0 115.   91.6 114.   82.2   121.
## 10  1879  144.  158.    133.  149.  108.  118.  113.  110.   78.7   124.
## # ℹ 136 more rows
## # ℹ 3 more variables: pork <dbl>, beef <dbl>, lamb <dbl>

Here, each observation is a year. Features correspond to specific types of food. Notice that this is different than the foods data set, in which the food items were observations.

Formatting

It is very important to properly format your code in a consistent way. Even though the code may run without errors and produce the desired results, it is extremely important to make sure that your code is well-formatted to make it easier to read and debug. We will follow the following guidelines:

It will make your life a lot easier if you get used to these rules right from the start. We will practice and review this in class.

Keyboard Shortcuts

For those of you who prefer to use keyboard shortcuts, there are two essential shortcuts, and many optional ones.

The first essential shortcut is the Command Palette. Cmd + shift + p on Mac or Ctrl + shift + p on Windows. This will open the command palette, which allows you to search for any command in RStudio. If you memorize this command, you don’t have to memorize the other commands.

The second is the shortcut to run the current line of code. This is Cmd + Enter on Mac or Ctrl + Enter on Windows. This will run the current line of code in the console. If you have multiple lines of code selected, it will run all of them.

Here is the complete keyboard shortcut list in the RStudio documentation.

Homework Questions

At at then end of each set of notes, such as this one, will be a short set of questions or activities to complete before the next class. Bring written solutions with you to class.

  1. Make sure you have R, RStudio, and all of the packages installed. If you are still having trouble with anything, please let me know during class.

  2. On a piece of paper, make an example of a tabular data set with five rows and three columns. This can capture any type of information you would like. (If you don’t have any ideas, try your “to do” list.) We will share these together in class.

  3. Give each of the columns of your data set names. Follow the variable name rules described above.

  4. Once you have finished reading and completing the items above, make sure you have responded to the survey.